Moving Beyond Linearity

The truth is almost never linear!
- Or almost never!
But often the linearity assumption is “good enough”
What about when its not?
- Polynomials
- Step Functions
- Splines
- Local Regression
- Generalized Additive Models
  - All of these models offer a lot of flexibility, without losing the ease and interpretability of linear models

Polynominal Regression

\(y_i = B_0 + B_1x_1 + B_2x_{i}^{2} + B_3x_{i}^{3} + ... + B_dx_{i}^{d} + \epsilon_i\)

Create new variables \(X_1 = X\), \(X_2 = X^2\), and so on, then treat as multiple linear regression
Not really interested in the coefficients; more interested in the fitted function values at any value \(x_0\):
- \(\hat{f}(x_0) = \hat{\beta_0} + \hat{\beta_1}x_0 + \hat{\beta_2}x_0^2 + \hat{\beta_3}x_3 + \hat{\beta_4}x_4\)
Since \(\hat{f}(x_0)\) is a linear function of the \(\hat{\beta_\ell}\), can get a simple expression for pointwise-variances \(Var[\hat{f}(x_0)]\) at any value of \(x_0\). In the figure above, we have computed the fit and pointwise standard errors on a grid of values for \(x_0\). We show \(\hat{f}(x_0) \pm 2 \cdot se[\hat{f}(x_0)]\)
We either fix the degree \(d\) at some reasonably low value, else use cross-validation to choose \(d\)
Logistic regression follows naturally. For example, in the figure we model:
- \(Pr(y_i > 250|x_i) = \frac{exp(B_0 + B_1x_1 + B_2x_{i}^{2} + B_3x_{i}^{3} + ... + B_dx_{i}^{d})}{1 + exp(B_0 + B_1x_1 + B_2x_{i}^{2} + B_3x_{i}^{3} + ... + B_dx_{i}^{d})}\)
- To get confidence intervals, compute upper and lower bounds on on the logit scale, and then invert to get on probability scale
- Can do separately on several variables—just stack the variables into one matrix, and separate out the pieces afterwards (see GAMs later)
- Caveat: polynomials have notorious tail behavior — very bad for extrapolation
- Can fit using \(y ~ poly(x, degree = 3)\) in formula

Step Functions

Another way of creating transformations of a variable — cut the variable into distinct regions
- \(C_1(X) = I(X < 35), C_2(X) = I(35 \leq X < 50), ... , C_3(X) = I(X \geq 65)\)

Easy to work with. Creates a series of dummy variables representing each group
Useful way of creating interactions that are easy to interpret. For example, interaction effect of Year and Age:
- \(I(Year < 2005) \cdot Age, I(Year \geq 2005) \cdot Age\)
  - Would allow for different linear functions in each age category
    - In R: I(year < 2005) or cut(age, c(18,25,40,65,90))
Choice of cutpoints or knots can be problematic. For creating nonlinearities, smoother alternatives such as splines are available

Piecewise Polynomials

Instead of a single polynomial in \(X\) over its whole domain, we can rather use different polynomials in regions defined by knots
- \(y _i = \left\{\begin{matrix} B_{01} + B_{11}x_i + B_{21}x_i^2 + B_{31}x_i^3 + \epsilon_i < c;\\ B_{02} + B_{12}x_i + B_{22}x_i^2 + B_{32}x_i^3 + \epsilon_i \geq c \end{matrix}\right.\)
Better to add constraints to the polynomials, e.g. continuity
Splines have the “maximum” amount of continuity

Splines

Linear Splines
- A linear spline with knots at \(\xi_k, k = 1,...,K\) is a piecewise linear polynomial continuous at each knot
- We can represent this model as:
  - \(y_i = \beta_0 + \beta_1b_1(x_i) + \beta_2b_2(x_i) + ... + + \beta_{K+1}b_K(x_i) + \epsilon\)
    - Where the \(b_K\) are basis functions:
      - \(b_1(x_i) = x_i\)
      - \(b_{k+1}(x_i) = (x_i - \xi_k) + , k = 1,...,K\)
    - Here the ()_+ means positive part:
      - \((x_i - \xi_k)_+ = \left\{\begin{matrix} x_i - \xi_k \ \ if \ \ x_i > \xi_k \\ 0 \ \ otherwise \end{matrix}\right.\)

Cubic Splines
- A cubic spline with knots at \(\xi_k, k = 1,...,K\) is a piecewise cubic polynomial continuous derivatices up to order 2 at each knot
- Again, we can represent this model with truncated power basis functions
  - \(y_i = \beta_0 + \beta_1b_1(x_i) + \beta_2b_2(x_i) + ... + + \beta_{K+3}b_{k+3}(x_i) + \epsilon_i\)
    - \(b_1(x_i) = x_i\)
    - \(b_2(x_i) = x_i^2\)
    - \(b_3(x_i) = x_i^3\)
    - \(b_{k+3}(x_i) = (x_i - \xi_k)_+^3, k = 1,...,K\)
    - Where \((x_i - \xi_k)_+^3 = \left\{\begin{matrix} (x_i - \xi_k)_+^3 \ \ if \ \ x_i > \xi_k \\ 0 \ \ otherwise \end{matrix}\right.\)

Natural Cubic Splines
- A natural cubic spline extrapolates linearly beyond the boundary knots. This adds 4 = 2 × 2 extra constraints, and allows us to put more internal knots for the same degrees of freedom as a regular cubic spline
Fitting splines in R is easy
- bs(x, …) for any degree splines, and ns(x, …) for natural cubic splines
  - Use package splines

Knot Placement
- One strategy is to decide \(K\), the number of knots, and then place them at appropriate quantiles of the observed \(X\)
- A cubic spline with K knots has \(K + 4\) parameters or degrees of freedom
- A natural spline with \(K\) knots has \(K\) degrees of freedom
Below is a comparison of a degree-14 polynomial and a natural cubic spline, each with 15df
- ns(age, df = 14)
- poly(age, deg = 14)

Smoothing Splines
- Consider this criterion for fitting a smooth function \(g(x)\) to some data
  - \(\underset{g \in S }{dsf} \sum_{i=1}^{n}(y_i-g(x_i))^2 + \lambda \int g''(t)^2dt\)
    - The first term is RSS and tries to make \(g(x)\) match the data at each \(x_i\)
    - The second term is a roughness penalty and controls how wiggly \(g(x)\) is. It is modulated by the tuning parameter \(\lambda \geq 0\)
      - The smaller \(\lambda\), the more wiggly the function, eventually interpolating \(y_i\) when \(\lambda = 0\)
      - As \(\lambda \to \infty\), the function \(g(x)\) becomes linear
- The solution is a natural cubic spline, with a knot at every unique value of \(x_i\). The roughness penalty still controls the roughness via \(\lambda\)
  - Smoothing splines avoid the knot-selection issue, leaving a single \(\lambda\) to be chosen
  - The algorithmic details are beyond the scope of this course. In R, the function smooth.spline() will fit a smoothing spline
  - The vector of \(n\) fitted values can be written as \(\mathbf{\hat{g}}_\lambda = \mathbf{S}_\lambda\mathbf{y}\), where \(\mathbf{S}_\lambda\) is a \(n \times n\) matrix (determined by the \(x_i\) and \(\lambda\))
  - The effective degrees of freedom are given by: \(df_\lambda = \sum_{i=1}^{n} \left\{\mathbf{S}_\lambda \right\}_{ii}\)
- Choosing \(\lambda\)
  - We can specify \(df\) rather than \(\lambda\)
    - In R: smooth.spline(age, wage, df = 10)
  - The leave-one-out (LOO) cross-validated error is given by: \(RSS_{cv}(\lambda) = \sum_{i=1}^{n}(y_i - \hat{g}_\lambda^{-i}(x_i))^2 = \sum_{i=1}^{n}\left [ \frac{y_i- \hat{g}_\lambda(x_i)}{1-\left\{\mathbf{S}_\lambda \right\}_{ii}} \right ]^2\)

Local Regression

With a sliding weight function, we fit separate linear fits over the range of X by weighted least squares. See text for more details, and loess() function in R

Generalized Additive Models

Allows for flexible nonlinearities in several variables, but retains the additive structure of linear models
- \(y_i = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + ... + f_p(x_{ip}) + \epsilon_i\)

Can fit GAM simply using natural splines:
- lm(wage ~ ns(year, df = 5) + ns(age, df = 5) + education)
Coefficients not that interesting; fitted functions are. The previous plot was produced using plot.gam
Can mix terms — some linear, some nonlinear — and use anova() to compare models
Can use smoothing splines or local regression as well:
- gam(wage ∼ s(year, df = 5) + lo(age, span = .5) + education)
GAMs are additive, although low-order interactions can be included in a natural way using, e.g. bivariate smoothers or interactions of the form ns(age,df=5):ns(year,df=5)